The main dataset that was used in this project is an open-source dataset containing property registration data for the Icelandic housing market, which was downloaded from this source: https://www.fasteignaskra.is/gogn/grunngogn-til-nidurhals. It contains information on historical purchase prices, property address, postal codes, square meters, numbner of rooms, type of hosuing, etc. To obtain geological data a supplementary dataset was needed containing latitude and longitude values for each property, and by merging these two datasets together the final dataset was created and ending being 75.4 MB in size.
The idea was to explore the Icelandic Housing market – which is a niche market – but as the team members are from Iceland, we thought it would be a great idea as the housing market is always relevant and the ideas of our project can be used for other housing market data, hoping to give insights to potential buyers, as it is for most people the main investment of their life.
The goal was to create a compelling and insightful story of the Icelandic housing market. Looking at price changes over the years, best buys and just in general fun and interesting things to look at, e.g. what property has been most often bought/sold over the span of 18 years. The end goal is that the reader that has had no previous experience with the Icelandic housing market will have gained insights and would have a highlevel overview of the market as a whole.
To better understand the dataset, we utilized the set of tools acquired during the course. The sections Data cleaning and preprocessing and Data stats take you through it. We also include a short subsection called Other datasets where we discuss the supplementary datasets to our main dataset, where we got them and what insights they can give to our main dataset.
We started with importing necessary packages, reading in the datasets and merging them into a single dataset and taking a closer look at the data we had.
# Data manipulation
import pandas as pd
import numpy as np
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium import Map, Html, Element
from branca.colormap import LinearColormap
import plotly.express as px
from bokeh.plotting import figure, show, save, output_file
from bokeh.models import ColumnDataSource, Legend
from bokeh.io import output_notebook, reset_output
from bokeh.palettes import Category20 as palette # Import a palette with sufficient colors
# Machine Learning
from sklearn.neighbors import BallTree
# Miscellaneous
import itertools # For cycling through color palette
import zipfile
import json
# Configuration for Jupyter Notebooks
%matplotlib inline
sns.set_theme(style="white") # Set the default seaborn style to 'whitegrid'
%config InlineBackend.figure_format = 'retina' # For higher resolution figures
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)
import plotly.io as pio # from IPython.display import Image
output_notebook()
# Pandas configuration
pd.set_option('display.max_columns', None) # Display any number of columns
# Paths to the datasets
kaupskra_path = 'kaupskra.csv.zip'
stadfangaskra_path = 'stadfangaskra.csv.zip'
# Reading in kaupskra and stadfangaskra data
kaupskra = pd.read_csv(kaupskra_path, encoding='latin1', sep=';')
stadfangaskra = pd.read_csv(stadfangaskra_path, encoding='utf-8', sep=',')
# Lets join kaupskra and stadfangaskra together on HEINUM to form our final dataset
data = pd.merge(kaupskra, stadfangaskra, on='HEINUM') # inner join
data.shape
(193005, 52)
# Taking a quick look at the data we have
data.head()
| FAERSLUNUMER | EMNR | SKJALANUMER | FASTNUM | HEIMILISFANG | POSTNR_x | HEINUM | SVFN | SVEITARFELAG | UTGDAG | THINGLYSTDAGS | KAUPVERD | FASTEIGNAMAT | FASTEIGNAMAT_GILDANDI | BRUNABOTAMAT_GILDANDI | BYGGAR | FEPILOG | EINFLM | LOD_FLM | LOD_FLMEIN | FJHERB | TEGUND | FULLBUID | ONOTHAEFUR_SAMNINGUR | FID | HNITNUM | SVFNR | BYGGD | LANDNR | MATSNR | POSTNR_y | HEITI_NF | HEITI_TGF | HUSNR | BOKST | VIDSK | SERHEITI | DAGS_INN | DAGS_LEIDR | GAGNA_EIGN | TEGHNIT | YFIRFARID | YFIRF_HEITI | ATH | NAKV_XY | HNIT | N_HNIT_WGS84 | E_HNIT_WGS84 | NOTNR | LM_HEIMILISFANG | VEF_BIRTING | HUSMERKING | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 500342 | 411 | R-005069/2006 | 2067729 | Melabraut 3 | 170 | 1024617 | 1100 | Seltjarnarnesbær | 2006-05-08 00:00:00.0 | 2006-05-10 09:22:39.0 | 25500 | 19610 | 90600.0 | 53950.0 | 1963 | 010101 | 102.3 | NaN | NaN | 3.0 | Fjölbýli | 1 | 1 | Stadfangaskra.fid-24f8547a_18ebb06342d_-3298 | 10032272 | 1100 | 1 | 117368 | NaN | 170 | Melabraut | Melabraut | 3.0 | NaN | NaN | NaN | 2007-09-07 | 2009-02-23 | HMS | 0.0 | 0.0 | NaN | NaN | NaN | POINT (354080 408959) | 64.152289 | -21.999498 | 956.0 | Melabraut 3 (117368) | Melabraut 3 (117368) | 3 |
| 1 | 679963 | 441 | D-006418/2021 | 2067730 | Melabraut 3 | 170 | 1024617 | 1100 | Seltjarnarnesbær | 2021-10-21 00:00:00.0 | 2021-10-25 15:18:50.0 | 50000 | 62250 | 100400.0 | 67900.0 | 1963 | 010201 | 139.8 | 820.0 | m² | 2.0 | Fjölbýli | 1 | 0 | Stadfangaskra.fid-24f8547a_18ebb06342d_-3298 | 10032272 | 1100 | 1 | 117368 | NaN | 170 | Melabraut | Melabraut | 3.0 | NaN | NaN | NaN | 2007-09-07 | 2009-02-23 | HMS | 0.0 | 0.0 | NaN | NaN | NaN | POINT (354080 408959) | 64.152289 | -21.999498 | 956.0 | Melabraut 3 (117368) | Melabraut 3 (117368) | 3 |
| 2 | 680332 | 441 | A-009807/2021 | 2067729 | Melabraut 3 | 170 | 1024617 | 1100 | Seltjarnarnesbær | 2021-11-01 00:00:00.0 | 2021-11-04 15:14:48.0 | 44000 | 48850 | 90600.0 | 53950.0 | 1963 | 010101 | 102.3 | 820.0 | m² | 3.0 | Fjölbýli | 1 | 0 | Stadfangaskra.fid-24f8547a_18ebb06342d_-3298 | 10032272 | 1100 | 1 | 117368 | NaN | 170 | Melabraut | Melabraut | 3.0 | NaN | NaN | NaN | 2007-09-07 | 2009-02-23 | HMS | 0.0 | 0.0 | NaN | NaN | NaN | POINT (354080 408959) | 64.152289 | -21.999498 | 956.0 | Melabraut 3 (117368) | Melabraut 3 (117368) | 3 |
| 3 | 702045 | 441 | B-005352/2023 | 2526176 | Melabraut 3 | 170 | 1024617 | 1100 | Seltjarnarnesbær | 2023-06-27 00:00:00.0 | 2023-08-03 12:17:21.0 | 119900 | 10200 | 99350.0 | 67950.0 | 010301 | 129.5 | 820.0 | m² | 2.0 | Fjölbýli | 1 | 1 | Stadfangaskra.fid-24f8547a_18ebb06342d_-3298 | 10032272 | 1100 | 1 | 117368 | NaN | 170 | Melabraut | Melabraut | 3.0 | NaN | NaN | NaN | 2007-09-07 | 2009-02-23 | HMS | 0.0 | 0.0 | NaN | NaN | NaN | POINT (354080 408959) | 64.152289 | -21.999498 | 956.0 | Melabraut 3 (117368) | Melabraut 3 (117368) | 3 | |
| 4 | 705839 | 441 | B-010279/2023 | 2067729 | Melabraut 3 | 170 | 1024617 | 1100 | Seltjarnarnesbær | 2023-11-15 00:00:00.0 | 2023-11-25 14:57:33.0 | 86500 | 84300 | 90600.0 | 53950.0 | 2023 | 010101 | 107.2 | 820.0 | m² | 3.0 | Fjölbýli | 1 | 1 | Stadfangaskra.fid-24f8547a_18ebb06342d_-3298 | 10032272 | 1100 | 1 | 117368 | NaN | 170 | Melabraut | Melabraut | 3.0 | NaN | NaN | NaN | 2007-09-07 | 2009-02-23 | HMS | 0.0 | 0.0 | NaN | NaN | NaN | POINT (354080 408959) | 64.152289 | -21.999498 | 956.0 | Melabraut 3 (117368) | Melabraut 3 (117368) | 3 |
The dataset contains 193.005 records of property registration with 52 columns of different data, some more relevant and useful than other. By taking a look at descriptions of these different columns we were able to identify the most important features. Here are links to the descriptions, but they are in Icelandic so google translate comes in handy: https://www.fasteignaskra.is/gogn/grunngogn-til-nidurhals/kaupskra-fasteigna/eigindalysing-kaupskrar & https://www.fasteignaskra.is/library/Samnyttar-skrar-/Fyrirtaeki-stofnanir/Nidurhal/Sta%C3%B0fangaskr%C3%A1%20eigindal%C3%BDsing.pdf.
The final set of columns/features we chose are these (english translation of the column to the right of the back slash):
FAERSLUNUMER/transaction_number: Property transaction numberFASTNUM/property_number: Property identification numberHEIMILISFANG/adress: AddressPOSTNR_x/postal_code: Postal codeSVEITARFELAG/municipality: MunicipalityTHINGLYSTDAGS/notarized_date: Date of registration of purchase agreement (notarized agreement)KAUPVERD/purchase_price: Purchase price of property, given in thousand ISKFASTEIGNAMAT/property_value: Property value/assessment at the time of the purchase agreement, given in thousand ISKFASTEIGNAMAT_GILDANDI/current_property_value: Current property value/assessment, given in thousand ISKBRUNABOTAMAT_GILDANDI/insurance_payout: Current fire compensation assessment or insurance payout if destroyed, given in thousand ISKBYGGAR/year_built: Year of construction/property builtEINFLM/size_square_meters: Unit area of propertyLOD_FLM/plot_size: Plot sizeLOD_FLMEIN/plot_size_units: Plot size units (not all plots given in same units)FJHERB/number_of_rooms: Number of roomsTEGUND/type: Property typeONOTHAEFUR_SAMNINGUR/unenforcable_contract: If contract is unenforceable then 1 otherwise 0N_HNIT_WGS84/latitude: North coordinates in WGS84 latitudeE_HNIT_WGS84/longitude: East coordinates in WGS84 longitudeFULLBUID/complete: If the property is complete then 1 otherwise 0## Data cleaning and pre-processing ##
# List of features we want to work with
features = ['FAERSLUNUMER','FASTNUM','HEIMILISFANG','POSTNR_x','SVEITARFELAG','THINGLYSTDAGS','KAUPVERD','FASTEIGNAMAT'
,'FASTEIGNAMAT_GILDANDI','BRUNABOTAMAT_GILDANDI','BYGGAR','EINFLM','LOD_FLM','LOD_FLMEIN','FJHERB'
,'TEGUND','ONOTHAEFUR_SAMNINGUR','N_HNIT_WGS84','E_HNIT_WGS84', 'FULLBUID']
# Filtering on those columns
data = data[features]
For better code transparency we decided to translate the column names from Icelandic to English. Also, translating the "types" of housing from Icelandic to English and droping all rows that have houses of type "Commercial property", "Other", "Garage/Shed" or "Summer house" as it was of less interest to us and we wanted to focus on properties the "average Joe" would buy as their main property.
# Renaming columns - translating to English
data.rename(columns={'FAERSLUNUMER': 'transaction_number'}, inplace=True)
data.rename(columns={'FASTNUM': 'property_number'}, inplace=True)
data.rename(columns={'HEIMILISFANG': 'address'}, inplace=True)
data.rename(columns={'POSTNR_x': 'postal_code'}, inplace=True)
data.rename(columns={'SVEITARFELAG': 'municipality'}, inplace=True)
data.rename(columns={'THINGLYSTDAGS': 'notarized_date'}, inplace=True)
data.rename(columns={'KAUPVERD': 'purchase_price'}, inplace=True)
data.rename(columns={'FASTEIGNAMAT': 'property_value'}, inplace=True)
data.rename(columns={'FASTEIGNAMAT_GILDANDI': 'current_property_value'}, inplace=True)
data.rename(columns={'BRUNABOTAMAT_GILDANDI': 'insurance_payout'}, inplace=True)
data.rename(columns={'BYGGAR': 'year_built'}, inplace=True)
data.rename(columns={'EINFLM': 'size_square_meters'}, inplace=True)
data.rename(columns={'LOD_FLM': 'plot_size'}, inplace=True)
data.rename(columns={'LOD_FLMEIN': 'plot_size_units'}, inplace=True)
data.rename(columns={'FJHERB': 'number_of_rooms'}, inplace=True)
data.rename(columns={'TEGUND': 'type'}, inplace=True)
data.rename(columns={'ONOTHAEFUR_SAMNINGUR': 'unenforcable_contract'}, inplace=True)
data.rename(columns={'HEINUM': 'designation_number'}, inplace=True)
data.rename(columns={'N_HNIT_WGS84': 'latitude'}, inplace=True)
data.rename(columns={'E_HNIT_WGS84': 'longitude'}, inplace=True)
data.rename(columns={'FULLBUID': 'complete'}, inplace=True)
# Translating the property types to English
data['type'] = data['type'].apply(lambda value: 'Apartment' if value == 'Fjölbýli'
else ('Plex/semi/linked house' if value == 'Sérbýli' else ('Garage/Shed' if value == 'Bílskúr/skúr'
else ('Commercial property' if value == 'Atvinnuhúsnæði' else ('Private house' if value == 'Einbýli'
else ('Summer house' if value == 'Sumarhús' else ('Other' if value == 'Annað' else value)))))))
# Dropping relevant types of properties
data = data[~data['type'].isin(['Other', 'Garage/Shed', 'Commercial property', 'Summer house'])]
An interesting metric when looking into property data is the purchase price per square meters, thus we add it in as a new column to our data frame, where the unit of measure is thousand ISK per $m^2$.
data['price_per_sq_meter'] = (data['purchase_price'] / data['size_square_meters']).round(4)
We then did the common convertion of setting the date column to be in datetime, the year built to integer, as well as stripping whitespace from all string columns. We also scaled the price and value columns, which were given in thousand ISK, to be in million ISK instead, since we are dealing with large numbers that usually run in millions.
# Converting the date column to datetime
data['notarized_date'] = pd.to_datetime(data['notarized_date'])
# Converting to int
data['year_built'] = pd.to_numeric(data['year_built'], errors='coerce').fillna(np.nan).astype(float).astype('Int64')
data['number_of_rooms'] = pd.to_numeric(data['number_of_rooms'], errors='coerce').fillna(np.nan).astype(float).astype('Int64')
# Selecting string columns and applying str.strip() to get rid off white space
string_columns = data.select_dtypes(include=[object])
for column in string_columns:
data[column] = data[column].str.strip()
# We decided to scale the price and value columns so they are given in million ISK instead of thousand
columns_to_scale = ['purchase_price', 'property_value', 'current_property_value', 'insurance_payout']
# Apply the scaling function to each column in the list
data[columns_to_scale] = data[columns_to_scale].apply(lambda x: x / 1000).round(2)
To work with relevant property data we decided to filter on unenforcable_contract = 0 since we only wanted useable records. Useable meaning a contract/agreement that can be used in evaluation of the market, a record that is belived to be correct and not biased (for example an unuseable contract would be a contract between parents and children, selling them the property for much less of the market value). Then there were a few properties connected to the postal code of 611 since it is connected to a remote island in the North of Iceland, and deemed as an outlier in this analysis.
data = data.loc[data['unenforcable_contract'] == 0] # Filter out unuseable records
data = data.loc[data['postal_code']!=611]
Finally we decided to enrich our data with the different regions of Iceland, which are 8 in total. We used this list https://is.wikipedia.org/wiki/Listi_yfir_%C3%ADslensk_p%C3%B3stn%C3%BAmer and also domain knowledge of postal codes in Iceland to map the regions, according to the postal codes. The regions of Höfuðborgarsvæðið and Suðurnes did not follow a nice pattern so they were manually inserted, but other regions were easier to map.
capital_region_postal_codes = [
101, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112, 113, 116, 121, 123, 124,
125, 127, 128, 129, 130, 132, 200, 201, 202, 203, 220, 221, 210, 270, 170, 225
]
sudurnes_postal_codes = [
190, 191, 230, 232, 233, 235, 240, 241, 245, 246, 250, 251, 260, 262
]
# Create a dictionary to map postal codes directly to their regions
postal_code_to_region = {pc: "IS1" for pc in capital_region_postal_codes}
postal_code_to_region.update({pc: "IS2" for pc in sudurnes_postal_codes})
for pc in range(300, 400):
postal_code_to_region[pc] = "IS3"
for pc in range(400, 500):
postal_code_to_region[pc] = "IS4"
for pc in range(500, 600):
postal_code_to_region[pc] = "IS5"
for pc in range(600, 700):
postal_code_to_region[pc] = "IS6"
for pc in range(700, 800):
postal_code_to_region[pc] = "IS7"
for pc in range(800, 901):
postal_code_to_region[pc] = "IS8"
# Function to map postal codes to regions
def map_postal_code_to_region(postal_code):
return postal_code_to_region.get(postal_code, "Unknown")
# Apply the mapping function to the postal code column
data['region'] = data['postal_code'].apply(map_postal_code_to_region)
# Map region codes to names
region_names = {
"IS1": "Höfuðborgarsvæði",
"IS2": "Suðurnes",
"IS3": "Vesturland",
"IS4": "Vestfirðir",
"IS5": "Norðurland vestra",
"IS6": "Norðurland eystra",
"IS7": "Austurland",
"IS8": "Suðurland"
}
# Replace region codes with names
data['region_name'] = data['region'].map(region_names)
data.head()
| transaction_number | property_number | address | postal_code | municipality | notarized_date | purchase_price | property_value | current_property_value | insurance_payout | year_built | size_square_meters | plot_size | plot_size_units | number_of_rooms | type | unenforcable_contract | latitude | longitude | complete | price_per_sq_meter | region | region_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 679963 | 2067730 | Melabraut 3 | 170 | Seltjarnarnesbær | 2021-10-25 15:18:50 | 50.0 | 62.25 | 100.4 | 67.90 | 1963 | 139.8 | 820.0 | m² | 2 | Apartment | 0 | 64.152289 | -21.999498 | 1 | 357.6538 | IS1 | Höfuðborgarsvæði |
| 2 | 680332 | 2067729 | Melabraut 3 | 170 | Seltjarnarnesbær | 2021-11-04 15:14:48 | 44.0 | 48.85 | 90.6 | 53.95 | 1963 | 102.3 | 820.0 | m² | 3 | Apartment | 0 | 64.152289 | -21.999498 | 1 | 430.1075 | IS1 | Höfuðborgarsvæði |
| 9 | 543572 | 2017735 | Hjallavegur 28 | 104 | Reykjavíkurborg | 2008-02-07 13:22:27 | 14.0 | 13.22 | 41.8 | 24.10 | 1949 | 55.0 | 512.8 | m² | 1 | Apartment | 0 | 64.144457 | -21.860799 | 1 | 254.5455 | IS1 | Höfuðborgarsvæði |
| 10 | 543573 | 2017736 | Hjallavegur 28 | 104 | Reykjavíkurborg | 2008-02-07 13:24:57 | 81.0 | 36.79 | 99.6 | 79.80 | 1949 | 203.6 | 512.8 | m² | 4 | Apartment | 0 | 64.144457 | -21.860799 | 1 | 397.8438 | IS1 | Höfuðborgarsvæði |
| 11 | 639835 | 2017735 | Hjallavegur 28 | 104 | Reykjavíkurborg | 2019-01-12 10:50:49 | 35.9 | 27.15 | 41.8 | 24.10 | 1949 | 55.0 | 512.8 | m² | 1 | Apartment | 0 | 64.144457 | -21.860799 | 1 | 652.7273 | IS1 | Höfuðborgarsvæði |
data.shape
(137415, 23)
Above is the cleaned and processed dataset, and after the filtering and pre-processing we have 137.415 records. We will analyze the dataset further in the coming sections.
To enrich our data we needed other supporting datasets, and this small section will. Some are less interesting like the consumer price index, used to adjust the prices to inflation in the histogram plots above (the data was found through this link: https://px.hagstofa.is/pxis/pxweb/is/Efnahagur/Efnahagur__visitolur__1_vnv__1_vnv/VIS01000.px).
Then we had two different interest rates, index and non-index. Here is link to the data of the rates:https://www.sedlabanki.is/annad-efni/meginvextir-si/. We then did some minor transformation and cleaning of the interest rate data as seen below.
# Reading in the csv files with the interest rates
interest_rates = pd.read_csv('interest_rates_banks.csv', sep=';', header=0)
policy_rates = pd.read_csv('policy_interest_rates.csv', sep=';', header=0)
# To datetime
interest_rates['Date'] = pd.to_datetime(interest_rates['Date'], format='mixed')
policy_rates['Date'] = pd.to_datetime(policy_rates['Date'], format='mixed')
# Dropping unwanted column
interest_rates.drop(columns=['Unnamed: 5'], inplace=True)
# replacing the , with . so it's in line with other units
interest_rates['non-indexed rates - lowest'] = interest_rates['non-indexed rates - lowest'].str.replace(',', '.')
interest_rates['non-indexed rates - highest'] = interest_rates['non-indexed rates - highest'].str.replace(',', '.')
interest_rates['indexed rates - lowest'] = interest_rates['indexed rates - lowest'].str.replace(',', '.')
interest_rates['indexed rates - highest'] = interest_rates['indexed rates - highest'].str.replace(',', '.')
policy_rates['Policy interest rates'] = policy_rates['Policy interest rates'].str.replace(',', '.')
# Change to numeric
interest_rates['non-indexed rates - lowest'] = pd.to_numeric(interest_rates['non-indexed rates - lowest'])
interest_rates['non-indexed rates - highest'] = pd.to_numeric(interest_rates['non-indexed rates - highest'])
interest_rates['indexed rates - lowest'] = pd.to_numeric(interest_rates['indexed rates - lowest'])
interest_rates['indexed rates - highest'] = pd.to_numeric(interest_rates['indexed rates - highest'])
policy_rates['Policy interest rates'] = pd.to_numeric(policy_rates['Policy interest rates'])
Lastly we found more fun data, namely shapefile data of the different regions in Iceland, which are 8 in total. We had already mapped the different regions into our dataset and then found the shapefiles for the different regions from the following data source: https://simplemaps.com/gis/country/is#admin1. To do a quick visualization of the regions we plotted a simple choropleth of the contract distribution of the regions, and to no surprise the captial region (Höfðuborgarsvæði) dominates the housing market.
zip_file_path = 'is.zip'
geojson_file_name = 'is.json'
with zipfile.ZipFile(zip_file_path, 'r') as z:
# Extract the GeoJSON file from the zip
with z.open(geojson_file_name) as file:
# Load the GeoJSON file as usual
iceland_geojson = json.load(file)
# Calculate the region_count
region_count = data.groupby('region_name')['transaction_number'].size().reset_index()
region_count.columns = ['region', 'count']
fig = px.choropleth_mapbox(region_count, geojson=iceland_geojson, locations='region', color='count',
featureidkey="properties.name", # Adjust this to match the key in your GeoJSON
color_continuous_scale="RdBu_r",
range_color=(region_count['count'].min(), region_count['count'].max()),
mapbox_style="carto-positron",
zoom=5, center = {"lat": 64.9631, "lon": -19.0208}, # Center on a relevant point
opacity=0.5,
labels={'count':'Number of contracts'}
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
pyo.iplot(fig)
In this section we wanted to understand in more detail the distribution of the dataset for different timeperiods, different municpalities and different data. We also wanted to understand the basic stats of the dataset as well the geological representation of the data.
We started by looking at the time horizon of our historical property contracts and how many municipalites are in Iceland discovering that the data is from May 2006 to March 2024, containing data for 62 different municipalities and 157 different postal codes (after doing some pre-processing).
print('Oldest contract is from', data['notarized_date'].min().strftime('%Y-%m-%d'))
print('Newest contract is from', data['notarized_date'].max().strftime('%Y-%m-%d'))
mun = data['municipality'].unique()
print(f'\nThere are {len(mun)} municipalities in Iceland')
mun = data['postal_code'].unique()
print(f'\nThere are {len(mun)} different postal codes in Iceland')
Oldest contract is from 2006-05-09 Newest contract is from 2024-03-20 There are 62 municipalities in Iceland There are 157 different postal codes in Iceland
We then created a simple bar chart of the historical distribution of contracts per year. The plot below reveals that there was a spike in 2007 which then drops drastically in 2008, the most likely reason for this trend being the financial crisis in 2008. From 2009 up until 2021 there is an upwards trend but now in the last 2 years it seems the housing market has cooled down from the record year of 2021.
# Extract the Year temporarily and count records, excluding 2024
yearly_counts = data['notarized_date'].dt.year[data['notarized_date'].dt.year != 2024].value_counts()
# Sort the years
yearly_counts = yearly_counts.sort_index()
# Plot the data
plt.figure(figsize=(10, 6)) # Set the figure size for better readability
yearly_counts.plot(kind='bar', color='skyblue') # Create a bar chart
plt.title('Number of contracts per Year')
plt.xlabel('Year')
plt.ylabel('Number of contracts')
plt.xticks(rotation=90) # Rotate x-axis labels for better readability
plt.savefig('number_of_contracts_per_year.png')
plt.show()
Then we plotted a bar chart showing how many historical property contracts have been made in each municipality.
# Extract the municipalities data
municipalities_counts = data['municipality'].value_counts()
# The data is already sorted by count in descending order due to value_counts()
# Plot the data
plt.figure(figsize=(10, 6)) # Set the figure size for better readability
# Create a bar chart with thicker bars (default width is 0.8, so try 1 for thicker bars)
municipalities_counts.plot(kind='bar', color='skyblue', width=1)
plt.title('Number of Contracts per Municipality')
plt.xlabel('Municipality')
plt.ylabel('Number of Contracts')
plt.xticks(rotation=90) # Rotate x-axis labels for better readability
plt.show()
No surprise that Reykjavíkurborg, the capital of Iceland, has the far greatest number of contracts, and smaller towns in the countryside have very few in comparison. By looking at the graph, and applying domain knowledge as "experts" in being from Iceland, we chose our "focus" municipalites to be the top 16 with regards to number of contracts. Most of them are from the capital region but we also have big towns in Eastern and Northern parts of Iceland.
mun_count = data.groupby('municipality').size().reset_index(name='Count').sort_values(by='Count', ascending=False)
focus_municipalities = mun_count.head(16)['municipality'].tolist()
data_mun = data.loc[data['municipality'].isin(focus_municipalities)]
To further understand the distribution of our dataset we decided to plot a bar chart showing the number of purchase agreements within our focus municipalities.
# Extract the municipalities data
municipalities_counts = data_mun['municipality'].value_counts()
# The data is already sorted by count in descending order due to value_counts()
# Plot the data
plt.figure(figsize=(10, 6)) # Set the figure size for better readability
municipalities_counts.plot(kind='bar', color='skyblue') # Create a bar chart
plt.title('Number of Contracts per Municipality')
plt.xlabel('Municipality')
plt.ylabel('Number of Contracts')
plt.xticks(rotation=90) # Rotate x-axis labels for better readability
plt.savefig('number_of_contracts_per_mun.png')
plt.show()
Looking at the number of contracts per municipality plot we see in greater detail the distribution of our focus municipalities, the top 3 are from the capital region but in 4th place we have Akureyrarbær which is the largest town in the North of Iceland and in 5th place we have Reykjanesbær which is located on the Southern Peninsula of Iceland.
It also made sense to plot the distribution of historical housing prices, using a histogram and also calculate the inflation adjusted price using the consumer price index (CPI) from the Central Bank of Iceland.
# Reading the CPI data
cpi_data = pd.read_csv('cpi_data.csv', sep=';', header=0)
cpi_data['Date'] = pd.to_datetime(cpi_data['Date'].str.replace('M', ''), format='%Y%m')
# Create new columns for year and month
data['year_month'] = data['notarized_date'].dt.to_period('M')
cpi_data['year_month'] = cpi_data['Date'].dt.to_period('M')
# Merge the DataFrames on year_month
data_inflation = pd.merge(data, cpi_data, on='year_month')
#Convert to float
data_inflation['CPI'] = data_inflation['CPI'].astype(float)
newest_date = data_inflation['notarized_date'].max()
# Base CPI value, the newest
cpi_value = data_inflation.loc[data_inflation['notarized_date'] == newest_date, 'CPI'].values[0]
data_inflation['purchase_price_inf_adj'] = data_inflation['purchase_price'] * (1 + ((cpi_value - data_inflation['CPI'])/data_inflation['CPI']))
# Plotting with adjusted prices
plt.figure(figsize=(10, 6))
sns.histplot(data_inflation['purchase_price_inf_adj'], kde=True, bins=200)
plt.title('Distribution of Purchase Prices - Adjusted for Inflation using CPI indexes')
plt.xlabel('Purchase Prices adjusted for inflation')
plt.ylabel('Density')
plt.show()
/opt/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
Hmm...above plot is not that nice to look at, but there seem to be some absurdily expensive properties, which is maybe not in a sense an outlier, but never the less, something out of the ordinary. If we only look at purchase prices less than 200 mISK we should get a better feeling for the distribution of the purchase prices.
# Filter the data to include only purchase prices <= 200 m ISK
filtered_data = data_inflation[data_inflation['purchase_price_inf_adj'] <= 200]
# Plotting with adjusted prices
plt.figure(figsize=(10, 6))
sns.histplot(filtered_data['purchase_price_inf_adj'], kde=True, bins=200)
plt.title('Distribution of Purchase Prices - Adjusted for Inflation using CPI indexes')
plt.xlabel('Purchase prices adjusted for inflation')
plt.ylabel('Density')
plt.show()
/opt/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
Well, yeah, of course there are always some properties that go way above the market trend...but hey, here we have fairly nice gaussian like bell, albeit a little skewed and has a long tail, indicating the more expensive properties.
Also, it makes sense to look at a scatter plot, where we plot square meters agains purchase price (adjusted for inflation). Here we would guess that with increasing size you would get an increase in purchase price, but it may be quite location based (where the property is located in Iceland), so not a linear relationship for the whole dataset.
# Scatter plot of purchase price adjusted to inflation to the square meters of the property
filter_data = data_inflation[(data_inflation['size_square_meters'] <= 500) & (data_inflation['notarized_date'] >= '2020-01-01')]
plt.figure(figsize=(10, 6))
plt.scatter(filter_data['size_square_meters'], filter_data['purchase_price_inf_adj'], alpha=0.5, color='blue')
plt.title('Purchase Price vs. Square Meters')
plt.xlabel('Square Meters')
plt.ylabel('Purchase Price (Inflation Adjusted)')
plt.grid(False)
plt.show()
Yes, some upward trends but it's really clustered and not much insight to gain from this, other than it can be quite a big difference when it comes to purchase price for the about the same sized properties. Only the biggest ones, past 350 square meters that look like to follow a more upwards trend for the whole dataset. This gave us the direction of exploring if this was true for all locations (regions/postal codes) or if we could differentiate by looking at specific locations.
# Define colors for each region
colors = ['blue', 'green', 'red', 'orange', 'purple', 'pink', 'brown', 'black', 'grey'] # Add more colors if needed
filter_data = data_inflation[(data_inflation['size_square_meters'] <= 500) & (data_inflation['notarized_date'] >= '2020-01-01')]
plt.figure(figsize=(10, 6))
# Iterate over unique regions
for i, region in enumerate(filter_data['region_name'].unique()):
region_data = filter_data[filter_data['region_name'] == region]
plt.scatter(region_data['size_square_meters'], region_data['purchase_price_inf_adj'], alpha=0.5, label=region, color=colors[i])
plt.title('Purchase Price vs. Square Meters')
plt.xlabel('Square Meters')
plt.ylabel('Purchase Price (Inflation Adjusted)')
plt.grid(False)
# Show legend
plt.legend()
plt.show()
Yes, much better, and clearly, the Capital Region (Höfuðborgarsvæðið) is much more expensive than the other regions.
# Define colors for each region
colors = ['blue', 'green', 'red', 'orange', 'purple', 'pink', 'brown', 'black', 'grey'] # Add more colors if needed
filter_data = data_inflation[(data_inflation['size_square_meters'] <= 500) & (data_inflation['notarized_date'] >= '2020-01-01')]
# Sample 5000 data points
filter_data_sample = filter_data.sample(n=1000, random_state=42)
plt.figure(figsize=(10, 6))
# Iterate over unique regions
for i, region in enumerate(filter_data_sample['region_name'].unique()):
region_data = filter_data_sample[filter_data_sample['region_name'] == region]
plt.scatter(region_data['size_square_meters'], region_data['purchase_price_inf_adj'], alpha=0.5, label=region, color=colors[i])
plt.title('Purchase Price vs. Square Meters')
plt.xlabel('Square Meters')
plt.ylabel('Purchase Price (Inflation Adjusted)')
plt.grid(False)
# Show legend
plt.legend()
plt.show()
Since we have latitude and longitude data of all registered residents we wanted to visualize what regions in Iceland have higher density, and we did that with drawing dots for each contract, having the home address as a popup. We arbitrarily chose to display all contracts from 2021, just to get a more high level overview.
# Filter data between 2021-01-01 and 2021-12-31
map_data = data[(data['notarized_date'] >= '2021-01-01') & (data['notarized_date'] <= '2021-12-31')][['address', 'latitude', 'longitude']]
iceland_map = folium.Map(location=[64.9631, -19.0208], zoom_start=6, tiles='CartoDB Positron')
# Add smaller markers/dots for incidents
for index, row in map_data.iterrows():
folium.CircleMarker(
location=[row['latitude'], row['longitude']],
radius=1, # Adjust the radius as needed
color='red',
fill=True,
fill_color='red',
popup=row['address']
).add_to(iceland_map)
iceland_map